arm pull
- North America > United States > Maryland (0.04)
- North America > Canada (0.04)
AT Preliminaries
We now present some technical results that will be repeatedly used in the rest of the paper. A direct corollary of the Chernoff-Hoeffding bound (see, e.g. We also use the following variation of Chernoff bound for sampling without replacement. We provide the lower bound proofs for the results in Section 4. We remark that these lower bounds In particular, lower bounds for offline multi-armed bandits are often information-theoretic and does not depend on adversarial instances. By Y ao's minimax principle, it suffices to prove the lower bound for deterministic algorithms over (n 1) ( n 1) ( n 1) We remark that even with the random arrival of arms, the sample lower bound in Theorem 1 still holds.
- North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > Italy > Lazio > Rome (0.04)
- (15 more...)
- Information Technology > Artificial Intelligence > Machine Learning (0.95)
- Information Technology > Communications (0.93)
- Information Technology > Data Science > Data Mining > Big Data (0.87)
- North America > United States > Maryland (0.04)
- North America > Canada (0.04)
- North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > Italy > Lazio > Rome (0.04)
- (18 more...)
- Information Technology > Communications (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (0.95)
- Information Technology > Data Science > Data Mining > Big Data (0.32)
Secure Best Arm Identification in the Presence of a Copycat
Consider the problem of best arm identification with a security constraint. Specifically, assume a setup of stochastic linear bandits with $K$ arms of dimension $d$. In each arm pull, the player receives a reward that is the sum of the dot product of the arm with an unknown parameter vector and independent noise. The player's goal is to identify the best arm after $T$ arm pulls. Moreover, assume a copycat Chloe is observing the arm pulls. The player wishes to keep Chloe ignorant of the best arm. While a minimax--optimal algorithm identifies the best arm with an $Ω\left(\frac{T}{\log(d)}\right)$ error exponent, it easily reveals its best-arm estimate to an outside observer, as the best arms are played more frequently. A naive secure algorithm that plays all arms equally results in an $Ω\left(\frac{T}{d}\right)$ exponent. In this paper, we propose a secure algorithm that plays with \emph{coded arms}. The algorithm does not require any key or cryptographic primitives, yet achieves an $Ω\left(\frac{T}{\log^2(d)}\right)$ exponent while revealing almost no information on the best arm.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Singapore (0.04)
Nearly Tight Bounds for Exploration in Streaming Multi-armed Bandits with Known Optimality Gap
We investigate the sample-memory-pass trade-offs for pure exploration in multi-pass streaming multi-armed bandits (MABs) with the *a priori* knowledge of the optimality gap $\Delta_{[2]}$. Here, and throughout, the optimality gap $\Delta_{[i]}$ is defined as the mean reward gap between the best and the $i$-th best arms. A recent line of results by Jin, Huang, Tang, and Xiao [ICML'21] and Assadi and Wang [COLT'24] have shown that if there is no known $\Delta_{[2]}$, a pass complexity of $\Theta(\log(1/\Delta_{[2]}))$ (up to $\log\log(1/\Delta_{[2]})$ terms) is necessary and sufficient to obtain the *worst-case optimal* sample complexity of $O(n/\Delta^{2}_{[2]})$ with a single-arm memory. However, our understanding of multi-pass algorithms with known $\Delta_{[2]}$ is still limited. Here, the key open problem is how many passes are required to achieve the complexity, i.e., $O( \sum_{i=2}^{n}1/\Delta^2_{[i]})$ arm pulls, with a sublinear memory size. In this work, we show that the ``right answer'' for the question is $\Theta(\log{n})$ passes (up to $\log\log{n}$ terms). We first present a lower bound, showing that any algorithm that finds the best arm with slightly sublinear memory -- a memory of $o({n}/{\text{polylog}({n})})$ arms -- and $O(\sum_{i=2}^{n}{1}/{\Delta^{2}_{[i]}}\cdot \log{(n)})$ arm pulls has to make $\Omega(\frac{\log{n}}{\log\log{n}})$ passes over the stream. We then show a nearly-matching algorithm that assuming the knowledge of $\Delta_{[2]}$, finds the best arm with $O( \sum_{i=2}^{n}1/\Delta^2_{[i]} \cdot \log{n})$ arm pulls and a *single arm* memory.
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- North America > United States > Texas > Brazos County > College Station (0.04)
- North America > United States > New York (0.04)
- (9 more...)
UCB algorithms for multi-armed bandits: Precise regret and adaptive inference
Han, Qiyang, Khamaru, Koulik, Zhang, Cun-Hui
Upper Confidence Bound (UCB) algorithms are a widely-used class of sequential algorithms for the $K$-armed bandit problem. Despite extensive research over the past decades aimed at understanding their asymptotic and (near) minimax optimality properties, a precise understanding of their regret behavior remains elusive. This gap has not only hindered the evaluation of their actual algorithmic efficiency, but also limited further developments in statistical inference in sequential data collection. This paper bridges these two fundamental aspects--precise regret analysis and adaptive statistical inference--through a deterministic characterization of the number of arm pulls for an UCB index algorithm [Lai87, Agr95, ACBF02]. Our resulting precise regret formula not only accurately captures the actual behavior of the UCB algorithm for finite time horizons and individual problem instances, but also provides significant new insights into the regimes in which the existing theory remains informative. In particular, we show that the classical Lai-Robbins regret formula is exact if and only if the sub-optimality gaps exceed the order $\sigma\sqrt{K\log T/T}$. We also show that its maximal regret deviates from the minimax regret by a logarithmic factor, and therefore settling its strict minimax optimality in the negative. The deterministic characterization of the number of arm pulls for the UCB algorithm also has major implications in adaptive statistical inference. Building on the seminal work of [Lai82], we show that the UCB algorithm satisfies certain stability properties that lead to quantitative central limit theorems in two settings including the empirical means of unknown rewards in the bandit setting. These results have an important practical implication: conventional confidence sets designed for i.i.d. data remain valid even when data are collected sequentially.
- North America > United States > California > Alameda County > Berkeley (0.28)
- North America > United States > New York (0.04)
- North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
- (4 more...)
A General Framework for Clustering and Distribution Matching with Bandit Feedback
Yavas, Recep Can, Huang, Yuqi, Tan, Vincent Y. F., Scarlett, Jonathan
We develop a general framework for clustering and distribution matching problems with bandit feedback. We consider a $K$-armed bandit model where some subset of $K$ arms is partitioned into $M$ groups. Within each group, the random variable associated to each arm follows the same distribution on a finite alphabet. At each time step, the decision maker pulls an arm and observes its outcome from the random variable associated to that arm. Subsequent arm pulls depend on the history of arm pulls and their outcomes. The decision maker has no knowledge of the distributions of the arms or the underlying partitions. The task is to devise an online algorithm to learn the underlying partition of arms with the least number of arm pulls on average and with an error probability not exceeding a pre-determined value $\delta$. Several existing problems fall under our general framework, including finding $M$ pairs of arms, odd arm identification, and $M$-ary clustering of $K$ arms belong to our general framework. We derive a non-asymptotic lower bound on the average number of arm pulls for any online algorithm with an error probability not exceeding $\delta$. Furthermore, we develop a computationally-efficient online algorithm based on the Track-and-Stop method and Frank--Wolfe algorithm, and show that the average number of arm pulls of our algorithm asymptotically matches that of the lower bound. Our refined analysis also uncovers a novel bound on the speed at which the average number of arm pulls of our algorithm converges to the fundamental limit as $\delta$ vanishes.
- Asia > Singapore (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Europe > Greece > Attica > Athens (0.04)